1 Figure 1 - QC

1.1 CpG recovery per single cell

## [1] 67
## [1] 93

1.2 Percent of reads aligned

## Source: local data frame [3 x 2]
## 
##         state  mean(map)
##         (chr)      (dbl)
## 1         NTC  0.3383333
## 2 Single-cell 30.3183333
## 3    Ten-cell 40.6050000

1.3 qPCR

## Source: local data frame [3 x 2]
## 
##         state mean(conversion_rate)
##         (chr)                 (dbl)
## 1         NTC              95.71250
## 2 Single-cell              97.37414
## 3    Ten-cell              96.95000

1.4 SC library CpG recovery rate

## Source: local data frame [3 x 2]
## 
##         state mean(cpg_count)
##         (chr)           (dbl)
## 1         NTC        3298.167
## 2 Single-cell      401970.006
## 3    Ten-cell     2328151.333

1.5 Double-checking copy-number patterns (last QC step)

## [1] "Loading CNV Data"

## [1] 62
## [1] 86

1.6 SC average methylation

  • Read in single-cells and remove cells within filters

  • Read-in regions bed files and Calculate average methylation

## 
Read 81.0% of 3494995 rows
Read 3494995 rows and 6 (of 6) columns from 0.128 GB file in 00:00:03
  • Merge all Enhancers and removing duplicate enhancers
## <environment: R_GlobalEnv>
  • Plot regional methylation

1.6.0.1 for martin’s tenure pres

1.6.0.2 Find outliers

1.7 Clustering on CpG methylation

1.8 Does conversion rate affect genome-wide methylation?

1.9 Bimodal CpG distribution

2 Figure 2 - Exploring blood enhancers

2.1 What is the distribution of enhancer methylation?

2.1.1 Single-cell resolution

2.1.2 Bulk resolution

2.1.3 Distribution of hypo/hyper

2.2 Finding levels of h3k4me1 for each enhancer, split by their methylation state

## Joining by: "id"
## Joining by: "cell"

2.3 Calculating significance values between hypo/hyper enhancers

2.4 Checking if “heterogenous” enhancers (as determined by SD of cell enhancers) are located in active, partial, or inactive enhancers

3 Figure 3 - Heterogenous CpGs

Calling CpG resolution hetereogeneity

3.1 Plot CpG heterogeneity

  • Write tables of top heterogenous CpG’s

3.2 Looking at the genomic contexts where heterogenous CpG’s lie

4 Figure 4 - Pairwise comparisons

SC pairwise heterogeneity - Making comparisons and calculating heterogeneity

  • Plot heterogeneity distribution

  • Removed outliers ESLAM-GAGTGG and ESLAM-TTAGGC

4.1 Clustering in HSC reg networks

## Source: local data frame [2 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    97
## 2       2    49
## [1] "13 LSK in cluster | 62 Total LSK"
## [1] "84 ESLAM in cluster | 84 Total ESLAM"

4.2 Clustering genomewide

4.2.1 Genomewide all cells

## Source: local data frame [3 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    69
## 2       2    46
## 3       3    31
## [1] "4 LSK in cluster | 62 Total LSK"
## [1] "27 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [3 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    69
## 2       2    46
## 3       3    31
## [1] "24 LSK in cluster | 62 Total LSK"
## [1] "45 ESLAM in cluster | 84 Total ESLAM"

4.2.2 Genomewide Eslam Only

4.2.3 Genomewide LSK only

4.3 Clustering in other regions

4.3.1 Clustering in Blood Enhancers

## Source: local data frame [3 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    56
## 2       2    13
## 3       3    77
## [1] "18 LSK in cluster | 62 Total LSK"
## [1] "38 ESLAM in cluster | 84 Total ESLAM"

4.3.2 Clustering in blood lin DMR’s

## Source: local data frame [3 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    27
## 2       2    33
## 3       3    86
## [1] "12 LSK in cluster | 62 Total LSK"
## [1] "21 ESLAM in cluster | 84 Total ESLAM"

4.3.3 Clustering in CpG Islands

## Source: local data frame [6 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    46
## 2       2    33
## 3       3    32
## 4       4    28
## 5       5     6
## 6       6     1
## [1] "10 LSK in cluster | 62 Total LSK"
## [1] "18 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [6 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    46
## 2       2    33
## 3       3    32
## 4       4    28
## 5       5     6
## 6       6     1
## [1] "18 LSK in cluster | 62 Total LSK"
## [1] "14 ESLAM in cluster | 84 Total ESLAM"

4.3.4 Clustering in CpG Island shores

## Source: local data frame [4 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    57
## 2       2    40
## 3       3    48
## 4       4     1
## [1] "19 LSK in cluster | 62 Total LSK"
## [1] "21 ESLAM in cluster | 84 Total ESLAM"

4.3.5 Clustering in UMRs

## Source: local data frame [6 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    45
## 2       2    33
## 3       3    34
## 4       4    31
## 5       5     2
## 6       6     1
## [1] "13 LSK in cluster | 62 Total LSK"
## [1] "20 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [6 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    45
## 2       2    33
## 3       3    34
## 4       4    31
## 5       5     2
## 6       6     1
## [1] "20 LSK in cluster | 62 Total LSK"
## [1] "25 ESLAM in cluster | 84 Total ESLAM"

4.3.6 Clustering in promoters

  • 2kb up 500bp down
## Source: local data frame [3 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    53
## 2       2    92
## 3       3     1
## [1] "35 LSK in cluster | 62 Total LSK"
## [1] "57 ESLAM in cluster | 84 Total ESLAM"

4.3.7 Clustering in gene bodies

## Source: local data frame [5 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    52
## 2       2    55
## 3       3    27
## 4       4    11
## 5       5     1
## [1] "6 LSK in cluster | 62 Total LSK"
## [1] "21 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [5 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    52
## 2       2    55
## 3       3    27
## 4       4    11
## 5       5     1
## [1] "25 LSK in cluster | 62 Total LSK"
## [1] "30 ESLAM in cluster | 84 Total ESLAM"

4.4 Negative control - Clustering in cortex enhancers

## Source: local data frame [4 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1   122
## 2       2    10
## 3       3     9
## 4       4     5
## [1] "55 LSK in cluster | 62 Total LSK"
## [1] "67 ESLAM in cluster | 84 Total ESLAM"

5 Figure 5 - Epigenetic potential of subclusters

5.1 Look at distribution of clusters (how many overlap)

## [1] 1
  • Define subcluster of purity as cells that fall into clusters

  • Grab aggregated CpG per cell populationand group cells belonging to the same “cluster” together in silico using BSmooth

## 
Read 17.2% of 21867550 rows
Read 41.4% of 21867550 rows
Read 65.4% of 21867550 rows
Read 89.3% of 21867550 rows
Read 21867550 rows and 3 (of 3) columns from 0.486 GB file in 00:00:06
## Error in `[.data.table`(.SD, , 3:12, with = FALSE): j out of bounds
## [BSmooth] preprocessing ... done in 19.1 sec
## [BSmooth] smoothing by 'chromosome' (mc.cores = 20, mc.preschedule = FALSE)
## [BSmooth] smoothing done in 1311.1 sec
  • Find DMRs from in silico bulk groups

  • Call DMRs

Criteria - differentially methylated CpG’s (dmCpG) are called if they differ by (a statistical difference).

## [1] -0.2703689  0.2749154

Then, nearby dmCpG’s are grouped together into DMRs if they are within some distance of each other.

## [1] 15617
  • Annotate DMRs with HOMER (nearest gene)
## Joining by: c("chr", "bin", "start", "end")

There are 14104 DMRs between pure and unpure

5.2 Analyize where the DMRs are

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • Filter DMRs that lie within regions implicated in transcription control

Defining activating and repressing regions as follows:

Negative Correlating regions (less methylation = more expression)

## [1] "3009 of 14104 DMRs lie in regions of epigenetic potential"
## Source: local data frame [3 x 2]
## 
##     Annotation     n
##          (chr) (int)
## 1        exon1    66
## 2   Intergenic  1689
## 3 promoter-TSS  1254
## [1] "124 genes that associate with both activated and repressed regions, out of 2382 total associated genes, are discarded (along with their regions)"
## Source: local data frame [2 x 2]
## 
##            state     n
##            (chr) (int)
## 1   hypo_in_pure  1301
## 2 hypo_in_unpure  1396
  • Call DMRs between LSK and ESLAM as control

Criteria - differentially methylated CpG’s (dmCpG) are called if they differ by (a statistical difference).

## [1] -0.2519766  0.2411145
## Joining by: c("chr", "bin", "start", "end")
## [1] 21583
## [1] "4284 of 21583 LSK/ESLAM DMRs lie in regions of epigenetic potential"
## Source: local data frame [3 x 2]
## 
##     Annotation     n
##          (chr) (int)
## 1        exon1    85
## 2   Intergenic  2568
## 3 promoter-TSS  1631
## [1] "237 genes that associate with both LSK and ESLAM regions, out of 3109 total associated genes, are discarded (along with their regions)"
## Source: local data frame [2 x 2]
## 
##           state     n
##           (chr) (int)
## 1 hypo_in_eslam  1772
## 2   hypo_in_lsk  1901

5.3 Heatmap of DMRs

5.3.1 heatmap of methylation within DMRs between pure and unpure

5.3.2 heatmap of methylation within DMRs between lsk and eslam

5.4 Find Gene Sets

5.4.0.1 Grabbed genes with gene ontology annotations:

  • GO:0002244 - hematopoietic progenitor cell differentiation
  • GO:0030098 - lymphocyte differentiation (leukocyte contains macrophages)
  • GO:0002573 - myeloid leukocyte differentiation (myeloid differentiation included erythrocytes as well, so had to zoom)
  • GO:0030097 - hemopoiesis in general
  • GO:0048568 - embryonic organ development
  • GO:0048598 - embryonic morphogenesis

See here for a visualization of these terms

5.4.0.2 Browse through ido amit’s (blood enhancer) paper for cell-specific expressed genes

“First, we estimated whether each transcription factor was expressed in each cell type. We used RNA-seq measurements, defining “expressed” as having more than 5 reads per million tags, and higher than 20% of its maximal expression levels across all cells.”

  • I think this means that the highest is >5 and is >20% than 2nd highest

5.4.0.3 Browse through other papers

DNMT3a and 3b double-KO paper

http://www.cell.com/cell-stem-cell/abstract/S1934-5909(14)00266-5

Summary

Summary

  • Mutant = less differentiation and more self renewal
  • Upregulated or downregulated is mutant vs wt (UP = more in Mutant)

Single-cell HSC transcriptome

http://link.springer.com/article/10.1186/s13059-015-0739-5/fulltext.html

Single-cell RNA-seq of HSCs reveal transcriptional heterogeneity largely explained by cell cycle related genes

  • note, probably won’t see anything because “DNA methylation is stable during replication and cell cycle arrest”

Aging-related genes

http://www.ncbi.nlm.nih.gov/pubmed/24792119

5.5 Calculate pvalues for each gene set and generate plot

  • Considering DMRs within promoter-TSS, exon1, Intergenic regions
  • difference in mCpG >3 sds
  • minimum 3 CpGs within 300 of each other
  • 210^{4} distance to TSS
  • 1 maximum meth in promoter-TSS and exon1
  • 1 maximum meth in Intergenic
  • Slope cutoff at 0.1

5.6 GREAT Analysis of DMRs

5.6.1 Analysis of all DMRs

  • Only plot top 30

  • Text analysis of all terms < 0.05 Hypergeometric FDR

5.6.2 Analysis of population-specific regulatory DMRs

  • ie in promoter, exon1, and intergenic

Pure = cluster 1 = the “close together” one = Orange

Unpure = cluster 2 = the more “spread out” one = Green

Using GREAT tool with settings Basal+extension (constitutive 5 kb upstream and 1 kb downstream, up to 20 kb max extension)

5.6.2.1 Checking the Methylation enriched TF terms

Some terms popped up where promoter motifs were enriched for certain TFs - check the methylation around the promoter of said TF to guestimate if expression of that TF is different

  • Orange = “pure” cluster 1
  • Green = “less pure” cluster 2
  • Grey = “unpure” no cluster
  • Red = ESLAM
  • Blue = LSK

5.6.2.2 Identifying probes

  • Promoter-centric view

  • Total affected view

5.7 Checking the promoter methylation of SuMO specific genes in pure vs differentiated cells

SuMO cells are ~60% pure

From table S4, top positive and negative correlated genes (http://www.sciencedirect.com/science/article/pii/S1934590915001629)

(genomic coordinates obtained from ensTrans genome browser, mm10)

Heatmap of all “genes positively correlated in SuMO cells” - basically genes that are higher expressed in SuMO vs rest.

## Joining by: "V1"

6 Figure 6 - Single-cell RNA-seq validation

  • Read table
ID HSC1_1 HSC1_2 HSC1_3 HSC1_4 HSC1_5 HSC1_6 HSC1_7 HSC1_8
0610005C13Rik 0 0 0 0 0 0 0 0
0610007P14Rik 25 0 304 0 154 550 117 177
0610008F07Rik 0 0 0 0 0 0 0 0
0610009B14Rik 0 0 0 0 0 0 0 0
0610009B22Rik 0 0 0 8 83 521 236 57
0610009D07Rik 241 0 66 60 356 173 385 64

6.1 RNAseq QC

6.1.0.1 Distribution of total read counts in single HSCs

6.1.0.2 Percentage of reads in mitochondrial genes

6.1.0.3 Total number of genes detected

## [1] 92

6.2 Expression of differentially expressed genes in SuMO vs non-SuMO

  • Euclidean distance and ward.D2 linkage method
## Joining by: "cell"

## initial  value 29.218440 
## iter   5 value 24.917247
## iter  10 value 24.072466
## final  value 24.004633 
## converged
## Joining by: "cell"

6.3 Explore genes with biological variation above technical variation

  • This is reported by the paper - ~4500 genes with variation greater than technical variation

  • Instead, perform a one-sided test against the null hypothesis that the true variance is at most the technical variation

## [1] 0.7371249

## 
## FALSE  TRUE 
## 20106  1025

We get 1025 DE genes wthat fail the null hypothesis where the CV is less than technical

6.4 Explore expression of genes that are differentially methylated

6.4.1 How many promoter DMR-associated genes overlap highly variable genes?

## [1] "44 DMR-associated genes not present in SC-RNAseq data"
V1 V2
total_RNAseq_genes 19997
DMR_associated_genes 469
Differentially_Expressed 894
Overlap 33
## [1] "The p-value of this overlap is 0.00703898464946137"

6.4.2 What is the average expression of DMR-associated genes?

## [1] "The probability that the distributions of expression of DMR genes and all genes are equal is 1.09055764392958e-11 using a Mann-Whitney non-parametric test."
## [1] "63 genes with 0 average expression across single cells"

6.4.3 Gene expression of DMR-associated genes

6.5 Heatmaps for pairwise clustering

6.5.1 heatmap of DMR-associated genes

  • Heatmap of SCs based on 1-spearman correlation for distance
## Source: local data frame [3 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    49
## 2       2    35
## 3       3     8
## Source: local data frame [3 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    49
## 2       2    35
## 3       3     8

## [1] "35 HSCs within cluster 1 out of 92 HSCs"
## [1] "49 HSCs within cluster 2 out of 92 HSCs"
  • Multidimensional scaling

6.5.2 Heatmap with all DE genes

  • Heatmap of SCs based on 1-spearman correlation for distance
## Source: local data frame [5 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    41
## 2       2    34
## 3       3     2
## 4       4    11
## 5       5     4
## Source: local data frame [5 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    41
## 2       2    34
## 3       3     2
## 4       4    11
## 5       5     4

## [1] "41 HSCs within cluster 1 out of 92 HSCs"
## [1] "34 HSCs within cluster 2 out of 92 HSCs"
  • Multidimensional scaling

6.5.3 Control heatmap with DE brain genes

## Source: local data frame [4 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    35
## 2       2    22
## 3       3    32
## 4       4     3
## [1] "0 LSK in cluster | 0 Total LSK"
## [1] "0 ESLAM in cluster | 0 Total ESLAM"
## Source: local data frame [4 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    35
## 2       2    22
## 3       3    32
## 4       4     3
## [1] "0 LSK in cluster | 0 Total LSK"
## [1] "0 ESLAM in cluster | 0 Total ESLAM"

## [1] "22 HSCs within cluster 1 out of 92 HSCs"
## [1] "32 HSCs within cluster 2 out of 92 HSCs"
  • Multidimensional scaling

6.5.4 Control heatmap with DE liver development genes

## Source: local data frame [4 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    25
## 2       2    29
## 3       3    13
## 4       4    25
## [1] "0 LSK in cluster | 0 Total LSK"
## [1] "0 ESLAM in cluster | 0 Total ESLAM"
## Source: local data frame [4 x 2]
## 
##   cluster     n
##     (int) (int)
## 1       1    25
## 2       2    29
## 3       3    13
## 4       4    25
## [1] "0 LSK in cluster | 0 Total LSK"
## [1] "0 ESLAM in cluster | 0 Total ESLAM"

## [1] "25 HSCs within cluster 1 out of 92 HSCs"
## [1] "25 HSCs within cluster 2 out of 92 HSCs"
  • Multidimensional scaling

6.6 Find DE genes between single-cells (using edgeR)

## Design matrix not provided. Switch to the classic mode.

##   Higher expressed in Pure Number of DE genes
## 1                    FALSE                  2
## 2                     TRUE                 51

No DE genes found using clusters defined by differentially expressed genes

## [1] "The p-value of the overlap between genes with biological variation and DE genes between groups is 0.015"
## Joining by: "gene"
## [1] "Genes that are DE between groups AND have promoter DMRS:  "

8 Comparison of in-silico bulk LSK and ESLAM cells

8.1 LSK vs ESLAM DMRs compared to previously published work

Compared to http://www.sciencedirect.com/science/article/pii/S1934590914003014, where they studied HSC -> MPP

8.1.1 Profiles of DMRs

8.1.1.1 Gain/Loss of DMRs

Gain = increase in methylation from HSC -> LSK

8.1.1.2 Distribution of DMRs

8.1.2 Overlap with genomic regions

8.1.2.1 With blood enhancers

## [1] "Binomial p-value of this overlap is 0"
## [1] "The fold enrichment is 1.27"

8.1.2.2 With TFBS

## [1] "Binomial p-value of this overlap is 0"
## [1] "The fold enrichment is 1.75"

8.1.3 Compare vs specific Paper defined regions

8.1.3.1 Gata 2

8.1.3.2 Sfpi1 (aka PU.1, aka Spi1)

8.1.3.3 HoxB4

8.1.3.4 HoxA7

8.1.3.5 H19 lncRNA

8.2 GREAT analysis of LSK vs ESLAM DMRs

8.2.1 All DMRs

8.2.2 Analysis of population-specific regulatory DMRs

  • ie in promoter, exon1, and intergenic

Using GREAT tool with settings Basal+extension (constitutive 2 kb upstream and 0.5 kb downstream, up to 20 kb max extension)

9 Done